Document Clustering at NTCIR-4 Workshop: Limiting Search Space of the K-Means Method Using Word Occurrence
نویسندگان
چکیده
In this paper, we propose a new document clustering method based on the K-means method (kmeans). In our method, we allow only finite candidate vectors to be representative vectors of kmeans. We also propose a method for constructing these candidate vectors using documents that have the same word. We participated in NTCIR-4 WEB Task D (Topic Classification Task) and experimentally compared our method with kmeans on this task.
منابع مشابه
Patent Map Generation Using Concept-Based Vector Space Model
This paper proposes a patent map generation system using concept-based vector space model and presents evaluation results from the NTCIR-4 patent feasibility study (FS) task. The concept-base is a knowledge base of words, which expresses each word as an associated vector. The word vectors are computed based on word co-occurrence in a target document set, therefore, the word vectors reflect targ...
متن کاملExperiments on Patent Retrieval at NTCIR-4 Workshop
In the Patent Retrieval Task in NTCIR-4 Workshop, the search topic is the claim in a patent document, so we use the claim text and the IPC information for the similarity calculations between the search topic and each patent document in the collection. We examined the effectiveness of the similarity measure between IPCs and the term weighting for the occurrence positions of the keyword attribute...
متن کاملFuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition
In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملChinese and Korean Topic Search of Japanese News Collections
UC Berkeley participated in the pivot bilingual task of the CLIR track at NTCIR Workshop 4. Our focus was on Chinese and Korean searches against the Japanese News document collection, using English as a pivot language. For comparison of our pivot techniques, we submitted Japanese monolingual and English Japanese bilingual search rankings as well. Two different commercial translation software pa...
متن کامل